Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014).
It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.
Overview
In previous worksheets (e.g. here and here) we used various commands in the tidyverse, like filter and group_by.
If you want to recap these commands you can start to use the cheatsheet, especially the part on groups and summaries.
Today we will cover three additional techniques which are important when working with real datasets:
1, Creating new variables/columns 2. ‘Pivoting’ or reshaping data from long to wide formats (or the reverse) 3. Joining two sources of data (e.g. two spreadsheets) into a single dataframe
Before you start
- Make sure you complete the worksheet on exploring data and plotting
- Create a new R Script file to save your workings
- Remember to load the tidyverse
library(tidyverse)Making new variables
Sometimes we need to create new columns in our dataset.
For example, let’s say we wanted to calculate someone’s body mass index (BMI) from their weight and height.
There is a built-in dataset called women, which contains heights and weights of 15 women in lbs and inches:
women %>% glimpse
> Rows: 15
> Columns: 2
> $ height <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
> $ weight <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 15…Explanation of the output We used glimpse in the block above. Glimpse is a bit like head, except it shows you a list of variables, with as many examples of the data as will fit on screen. The head command is the other way around — it show columns along the top, and the first few rows of data vertically. Either can be useful to quickly check what is in a dataset, but I often prefer glimpse if there are a lot of variables in a file because all of the column names are shown.
To calculate a BMI we first need to convert the heights to from inches to meters, and the weights from lbs to kilograms.
metric_women <- women %>%
mutate(
height_m = height*0.0254, # approx conversion from inches to m
weight_kg= weight*0.45 # conversion from lbs to kg
)Explanation: We used the mutate function to covert lbs to kg and inches to m, and saved these two new columns in a new dataset called metric_women.
We can see the new columns here:
metric_women %>% head(3)
> height weight height_m weight_kg
> 1 58 115 1.4732 51.75
> 2 59 117 1.4986 52.65
> 3 60 120 1.5240 54.00BMI is calculated as \(\dfrac{kg}{m^2}\). We can use mutate again to create a new column to store this calculation:
metric_women %>%
mutate(BMI = weight_kg / height_m^2) %>%
head(3)
> height weight height_m weight_kg BMI
> 1 58 115 1.4732 51.75 23.84443
> 2 59 117 1.4986 52.65 23.44374
> 3 60 120 1.5240 54.00 23.25005Explanation: We used mutate again to make a new column, BMI. This contains women’s weight divided by their squared height (^2 means to the power of 2, or squared, in R-speak).
Create a density plot of BMI scores in the women dataset. It should look like this:
What is the median BMI in the sample?
Hints:
- You will need to use
mutateto create a new column containing BMI scores - When you create a dataset with this new column, either save the result with a new variable name and pipe that to
ggplot, or pipe the result directly intoggplot. Either approach is fine. - We covered density plots in the first worksheet, and they are also in the cheatsheet.
Pivoting (reshaping)
Data is commonly stored in either wide or long format.
If you used SPSS to do a t-test or ANOVA during your undergraduate degree, you likely stored and analysed the data in wide format.
In wide format, each row represents the observations from a single participant. Each measurement for a given participant are stored in separate columns.
This is often called row per subject data. An example is the built in attitude dataset:
attitude %>%
head()
> rating complaints privileges learning raises critical advance
> 1 43 51 30 39 61 92 45
> 2 63 64 51 54 63 73 47
> 3 71 70 68 69 76 86 48
> 4 61 63 45 47 54 84 35
> 5 81 78 56 66 71 83 47
> 6 43 55 49 44 54 49 34Explanation: Each row contains scores for a particular employee on various measures. To find out more about these data you can type ?attitude into the console.
Let’s say we want a single plot of all these variables, something like this:
To do this we first need to convert the data to long format. In long format, each observation is saved in its own row, rather than across multiple columns.
It’s often called “row per observation” data.
Pivoting is where you take a long data file (lots of rows, few columns) and make it wider. Or where you take a wide data file (lots of columns, few rows) and make it longer.
We can convert from wide to long using the pivot_longer command.
Another term sometimes used for pivoting data to long form is melting it. img: TrueWarrior
To see why the command is called ‘pivot_longer’, imagine trying to reshape just the first two rwos of the attitude dataset:
> rating complaints privileges learning raises critical advance
> 1 43 51 30 39 61 92 45
> 2 63 64 51 54 63 73 47
If we use pivot_longer on this selection, we end up with this:
attitude %>%
pivot_longer(everything())
> # A tibble: 210 x 2
> name value
> <chr> <dbl>
> 1 rating 43
> 2 complaints 51
> 3 privileges 30
> 4 learning 39
> 5 raises 61
> 6 critical 92
> 7 advance 45
> 8 rating 63
> 9 complaints 64
> 10 privileges 51
> # … with 200 more rowsExplanation of the command: We selected a subset of columns and rows. Then we used pivot_longer(everything()) to make this into long form data. The everything() tells R to merge values from all of the columns into a single new column called value, and to keep track of the original variable name in a new column called name)
The change works like like this:
Converting from wide format to long format
One problem with this example we don’t have a record of which participant was which in the attitude dataset, because the mapping to participants was implicit: each row was a new participant, and participant number was not recorded in the file.
We can create an explicit participant identifier by adding a new column. For this we use the mutate and row_number() commands:
attitude_with_person <- attitude %>%
mutate(person = row_number()) %>%
head(2)
attitude_with_person
> rating complaints privileges learning raises critical advance person
> 1 43 51 30 39 61 92 45 1
> 2 63 64 51 54 63 73 47 2Now we have a column called person which stores the row nuber.
But this means if we pivot_longer again, we will need to tell R which columns we would like to pivot.
If we don’t do this then the person column gets melted with everything else so we lose track of which response belonged to which participant:
attitude_with_person %>%
pivot_longer(everything())
> # A tibble: 16 x 2
> name value
> <chr> <dbl>
> 1 rating 43
> 2 complaints 51
> 3 privileges 30
> 4 learning 39
> 5 raises 61
> 6 critical 92
> 7 advance 45
> 8 person 1
> 9 rating 63
> 10 complaints 64
> 11 privileges 51
> 12 learning 54
> 13 raises 63
> 14 critical 73
> 15 advance 47
> 16 person 2Explanation of the output Because we didn’t tell pivot_longer which columns we wanted to pivot, it put all the values into a single new column called value. This included our participant identifier, person which is not what we wanted.
We can exclude person from the pivoting by writing:
attitude_with_person %>%
pivot_longer(-person) %>%
head()
> # A tibble: 6 x 3
> person name value
> <int> <chr> <dbl>
> 1 1 rating 43
> 2 1 complaints 51
> 3 1 privileges 30
> 4 1 learning 39
> 5 1 raises 61
> 6 1 critical 92Explanation of the command and output: Here, we still use pivot_longer but this time we put -person between the parentheses. The minus sign, -, means don’t include this variable, so -person ends up meaning include all columns except person, which is what we wanted. The output now retains the person column, but pivots the other variables. This means we can tell which person provided each datapoint.
Use some tidyverse commands you already know, plus pivot_longer, to produce this plot using the attitude dataset:
- Check the cheatsheet if you are not sure how to add a boxplot layer (hint, it starts with
geom_, so you might be able to guess or autocomplete it without looking it up) - You need to select only the three variables shown first
- It’s not necessary to creat a person identifier for this plot (although it won’t hurt if you do)
Pivoting data to make summaries
Imagine we want a table of the mean score for each question in the attitude dataset.
This would be fiddly if we just tried to use summarise on wide format data. But if we use pivot_longer, group_by and then summarise (in that order) it’s possible to take the data and make a table like this with 3 instructions to R:
> # A tibble: 7 x 3
> Name Mean SD
> <chr> <dbl> <dbl>
> 1 advance 42.9 10.3
> 2 complaints 66.6 13.3
> 3 critical 74.8 9.89
> 4 learning 56.4 11.7
> 5 privileges 53.1 12.2
> 6 raises 64.6 10.4
> 7 rating 64.6 12.2
Combine the pivot_longer, group_by and summarise commands (in that order) to reproduce the table above.
Hints:
- You want to pivot all of the variables in the attitude dataset this time
- We covered using
summarisein the first and second worksheets. If you want a reminder, the cheatsheet might be the quickest place to look
Pivoting wider
Sometimes we have the opposite problem: We have long data, but want it in wide format. For example, we might want a table where it’s easy to compare between different years, like this:
gapminder::gapminder %>%
filter(year > 1990) %>%
pivot_wider(id_cols=country,
names_from=year,
values_from=gdpPercap) %>%
head(3) %>%
pander::pander("GDP per-capita in 3 countries in 3 different years, from the gapminder dataset.")| country | 1992 | 1997 | 2002 | 2007 |
|---|---|---|---|---|
| Afghanistan | 649.3 | 635.3 | 726.7 | 974.6 |
| Albania | 2497 | 3193 | 4604 | 5937 |
| Algeria | 5023 | 4797 | 5288 | 6223 |
Instead of making the data longer, now we want to pivot_wider.
As we saw before, the gapminder data is a fairly long format. There are multiple rows per-country corresponding to different years.
Let’s say we want to compare GDP in different years. We first need to select the data we want — country, year and GDP:
gapminder1990s <- gapminder::gapminder %>%
select(country, year, gdpPercap) %>%
filter(year >= 1990)Then we pivot_wider:
gapminder1990s %>%
pivot_wider(names_from = year, values_from = gdpPercap) %>%
head()
> # A tibble: 6 x 5
> country `1992` `1997` `2002` `2007`
> <fct> <dbl> <dbl> <dbl> <dbl>
> 1 Afghanistan 649. 635. 727. 975.
> 2 Albania 2497. 3193. 4604. 5937.
> 3 Algeria 5023. 4797. 5288. 6223.
> 4 Angola 2628. 2277. 2773. 4797.
> 5 Argentina 9308. 10967. 8798. 12779.
> 6 Australia 23425. 26998. 30688. 34435.Explanation of the command and output: We started with multiple rows per country, corresponding to years. We used pivot_wider with names_from = year to create new columns for each year in the data. We used values_from=gdpPercap to tell pivot_longer to use the GDP numbers to populate the table. The resulting table helps us compare years within countries, or between countries for a given year.
Experiment for yourself with pivot_longer and pivot_wider:
- Try using pivot_longer with some different datasets, perhaps
irisandmtcars - Use the pivoted data with
group_byto create summary tables
If you have time, now try the extension exercises
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.